# Pseudo-Differential Sensing Framework for STT-MRAM: A Cross-Layer Perspective

Wang Kang, Member, IEEE, Liang Chang, Student Member, IEEE, Zhaohao Wang, Member, IEEE, Weifeng Lv, Guanyu Sun, Member, IEEE, and Weisheng Zhao, Senior Member, IEEE

Abstract—With the rapid increase of leakage currents, non-volatile memories have become competitive candidates in the next-generation computer architecture. Among them, STT-MRAM shows great promise in working memory with high density, high speed and tremendous endurance, etc. However, based on our investigations, the dynamic write power and read reliability are two critical challenges of STT-MRAM. In this work, we propose a synergistic pseudo-differential sensing (PDS) framework that employs device, circuit and architectural techniques to address these challenges. In specific, three design techniques, including cell cluster, asymmetric sensing amplifier and self-error-detection-correction, are proposed to implement the PDS framework. We show that the holistic device-circuit-architecture cross-layer co-design enables STT-MRAM to be utilized in the cache memory, benefiting from the improved density, reliability and energy-efficiency. Our experimental results show that the proposed PDS scheme improves the read margin by  $\sim$ 35.6 percent, reduces the area, read latency, read energy, write latency and write power by  $\sim$ 46.7,  $\sim$ 9.8,  $\sim$ 30.3,  $\sim$ 2.3 and  $\sim$ 31.1 percent respectively, compared with the typical 1T1MTJ cell structure for the cache capacity of 8 MB. In addition, the proposed PDS scheme reduces the dynamic energy by  $\sim$ 32.9 percent and leakage energy by  $\sim$ 830 percent, improves the IPC by  $\sim$ 1.3 percent and miss rate by  $\sim$ 36.9 percent respectively, compared with conventional SRAM based cache.

Index Terms—Asymmetric sensing, dynamic write power, read reliability, STT-MRAM

#### 1 Introduction

7 ITH the continuous scaling of process technology, leakage currents induced static power consumption and reliability issues have become critical challenges for conventional semiconductor technology, greatly limiting the scalability. Moreover, the increasing data throughput requirement drives more and more processing cores and memories on a single chip, further exacerbating these challenges [1], [2], [3]. Recently, emerging nonvolatile memory technologies, such as phase-change memory (PCM), spin transfer torque magnetic random access memory (STT-MRAM) and resistive random access memory (RRAM), have attracted considerable attention and been extensively studied in academia and industry [4], [5], [6]. Thanks to the nonvolatility, data information can be maintained even if the supply power is off. In this case, we can cut off the supply power when the system is in idle states, thus significantly reducing the static energy consumption. In particular, STT-MRAM, which employs a bi-directional current to write data information into a memory cell, has

 W. Kang and W. Lv are with Spintronics Interdisciplinary Center and School of Computer Science and Engineering, Beihang University, Beijing 100191, China. E-mail: wang.kang@buaa.eud.cn, lwf@nlsde.buaa.edu.cn.

Manuscript received 31 May 2016; revised 11 Aug. 2016; accepted 12 Aug. 2016. Date of publication 17 Aug. 2016; date of current version 16 Feb. 2017. Recommended for acceptance by S. Y. Huang.

For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TC.2016.2601330

shown great potentials in the next-generation computer architecture. In addition to the nonvolatility, STT-MRAM provides high integration density close to DRAM, fast access speed as well as practically unlimited endurance comparable to SRAM [7], [8]. These advantageous features enable STT-MRAM to be a rather promising nonvolatile memory candidate as an alternative of volatile SRAM (or DRAM) in the working memory applications.

Although very attractive, current STT-MRAM technology also suffers from some challenges before wide applications. The first key challenge of STT-MRAM is the dynamic write power consumption, which is much higher than SRAM [9], [10]. As we known that STT-MRAM utilizes a bi-directional current to write data information into a memory cell (i.e., magnetic tunnel junction, MTJ) through the STT mechanism. First, the STT-driven MTJ switching requires a relatively large current to overcome the energy barrier between the two stable states of the MTJ device. Second, the STT switching mechanism is intrinsically stochastic and the actual time to complete a write operation varies dramatically among all the memory cells of a chip, thereby a long write current pulse is required for a reliable write operation if taking into consideration the worst-case corner. In addition, the process-voltage-temperature (PVT) variations further increase the cell-to-cell and cycle-to-cycle diversities. These factors result in huge dynamic write power overhead for a frequently-accessed working memory. It has shown that the dynamic write power per bit of STT-MRAM is one or two orders of magnitude higher than that of SRAM, as shown in Fig. 1 [9], [10]. The second challenge of STT-MRAM is the poor read reliability. This can be explained from two perspectives: First, the intrinsic tunnel magneto-resistance (TMR) ratio of MTJ is generally

L. Chang, Z. Wang and W. Zhao are with Fert Beijing Institute, Spintronics Interdisciplinary Center, Beihang University, Beijing 100191, China.
 E-mail: gsun@pku.edu.cn, {zhaohao.wang, weisheng.zhao}@buaa.edu.cn.

G. Sun is with the Center for Energy-efficient Computing and Applications, Peking University, Beijing 100871, China. E-mail: gsun@pku.edu.cn.



Fig. 1. The energy-delay performance comparison between STT-MRAM and conventional CMOS technology (or SRAM).

limited (typically < 300 percent at room temperature) by the device materials and structures, resulting in limited read margin (RM). Second, the inevitable PVT variations of both the MTJs and CMOS transistors further degrades the RM [11], [12]. Increasing the read current can improve the RM to some extent, but it should be noted that large current also leads to high read disturbance (RD) probability, again affecting the read reliability. Here RD is defined as the unintentional erroneously switching of the MTJ during the read operations. Therefore there exists a design trade-off between RM (or decision error) and RD, as shown in Fig. 2. In addition, as process technology continuously scales down, on one hand, the critical switching current density for MTJ switching reduces, which means the read current should decrease accordingly to avoid RD; on the other hand, the PVT variations rapidly grow, reducing read current unavoidably results in degraded RM. The read reliability issue has becomes a new critical barrier for nanoscale STT-MRAM [13].

Until now, extensive studies have been carried out to address the dynamic write power and the read reliability challenges of STT-MRAM. To reduce the dynamic write power of STT-MRAM, two strategies can be employed. The first one is to reduce the required write current amplitude and pulse duration [14], [15] from the device design point of view. This strategy is direct and effective, but it is generally limited by the device material and structure. The second one is to reduce the write operation frequency through circuit or architecture level design techniques, e.g., read-modify-write and bypassing algorithms [16], [17], [18]. To improve the read reliability, enlarging the intrinsic TMR ratio of MTJ is the most efficient solution, however the margin is very limited at room temperature according to the current technology [19]. Design techniques from circuit level are now preferable. For example, the self-reference sensing (SRS) scheme, including destructive and non-destructive ones, can reduce the PVT variations by eliminating the reference cell. However the destructive SRS scheme needs to restore the data back to the memory cell after each read operation, wasting much dynamic energy, while the non-destructive one also achieves limited RM [20], [21]. Differential sensing (DS) is an effective solution to improve the RM, however, it requires two complementary memory



Fig. 2. The design trade-off between the decision error (or RM) and read disturbance. This issue is more serious as technology scales.

cells to store only one bit of data, introducing large area and power overheads [22], [23].

As can be seen, these previous studies try to solve the challenges of STT-MRAM with single-layer techniques, which are ineffective in practice. In addition, these studies ignore a fact that the dynamic write power and read reliability of STT-MRAM are actually interrelated together. For example, lowering the critical switching current density of MTJ can reduce the dynamic write power, which, however, results in increase of RD [24]. Therefore, cross-layer solutions are preferable for addressing these challenges. In this work, we propose a synergistic pseudo-differential sensing (PDS) framework for STT-MRAM to jointly address the dynamic write power and read reliability concerns from a holistic device-circuit-architecture cross-layer co-design perspective. In order to implement the PDS framework, we propose three techniques, including cell cluster structure, asymmetric sensing amplifier (ASA), and self-error-detection-correction (SEDC), from device level, circuit level and architecture level, respectively. Our experimental results show that the proposed PDS framework is able to achieve rather good performance in terms of area, dynamic power, latency as well as reliability. In summary, the main contributions of this paper can be summarized as follows,

- We propose a synergistic PDS framework to jointly address the dynamic write power and read reliability concerns of STT-MRAM from a holistic cross-level co-design perspective.
- We propose a cell cluster structure to re-organize the data bits stored in the memory cells. This cell cluster structure adds redundancy to improve read reliability as well as to reduce write frequency.
- We propose a novel ASA, which utilizes the DS concept, to readout the data stored in the cell cluster, greatly improving the RM.
- We propose a SEDC module combined with the cell cluster structure and ASA to utilize the redundancy for detecting or correcting errors.
- We develop a cross-layer evaluation platform and evaluate the performances of the proposed PDS technique in STT-MRAM.



Fig. 3. Typical 1T1MTJ bit-cell structure and writing method.

The remainder of this paper is organized as follows. Section 2 introduces the backgrounds and fundamentals of STT-MRAM. In Section 3, we introduce the critical challenges of STT-MRAM and our motivation. In Section 4, we present the concept and implementation of our proposed PDS framework. The cell cluster structure, ASA and SEDC module are also presented in this section. The evaluation platform and experimental results are reported and analysed in Section 5. A summary of the related work is given in Section 6, followed by the conclusion of this paper in Section 7.

# 2 BACKGROUNDS OF STT-MRAM

A typical STT-MRAM bit-cell consists of a MTJ connected in series with a NMOS access transistor, named 1T1MTI cell structure [7], as shown in Fig. 3. The MTJ is the core device for storing data bit information, while the NMOS transistor acts as an access control device and provides the write/read driving currents. A cluster of 1T1MTJ cells then forms of a memory array combining with the peripheral circuits. Each memory bit-cell can be randomly accessed by controlling the bit-line (BL), word-line (WL) and source-line (SL). An MTJ is mainly composed of three ultra-thin layers: one oxide barrier layer (e.g., MgO) sandwiched between two ferromagnetic (FM) layers (e.g., CoFeB). Generally, the magnetization orientation of one FM layer is fixed (named pinned layer, PL) while the other FM layer is free to change (named free layer, FL) by magnetic field, charge current or voltage. Depending on the relative magnetization orientation (parallel (P) or anti-parallel (AP)) of the two FM layers, an MTJ can present two stable resistance states (i.e., low resistance,  $R_P$  and high resistance,  $R_{AP}$ ). Therefore each MTJ can store one bit of data information. The resistance difference between the two stable resistance states can be characterized by the TMR ratio, i.e.,  $TMR = (R_{AP} - R_P)/R_P$ . By using the STT mechanism [25], only a bi-directional charge current is required to write data information into the memory cell (i.e., MTJ). On the other hand, due to the TMR ratio, the data bit stored in the memory cell can be readout by distinguishing the two different resistance states of the MTJ utilizing voltage or current sensing techniques.

The magnetization of the FM layers can be either in the film plane or perpendicular to the film plane, which defines two types of MTJs: in-plane MTJ and perpendicular MTJ. Physical mechanisms and device properties differ widely between these two types of MTJs. In specific, the in-plane MTJ depends mainly on the shape anisotropy, whereas the perpendicular MTJ owes mainly to the bulk (or interfacial) perpendicular magnetic anisotropy (PMA) from the bulk of the magnetic material (or from

the interface or surface) [26], [27]. Because of the different physical mechanisms, the device properties, such as thermal stability and critical switching current, which characterize the performances of data retention and writability of MTJ respectively, are also different between the two types of MTJs. In general, perpendicular MTJ, especially the interfacial perpendicular one, outperforms in-plane MTJ as technology scales down. Based on our investigations, it is challenging for the in-plane MTJ to shrink down to 30 nm and below, due to the requirement of a relatively large aspect ratio (AR, which is defined as the ratio between the length and width of the MTJ) to maintain the shape anisotropy and thermal stability factor. In addition, the critical switching current of in-plane MTJ is proportional to the thermal stability and MTJ volume, resulting in high dynamic write energy. These problems greatly limit the utilization of in-plane MTJs in high-density and energy-efficient applications. Alternatively, recent experimental results have reported advanced perpendicular MTJ devices down to sub-10 nm with rather good properties in terms of thermal stability and writability [28], [29], [30]. Due to the promising potentials in nanoscale memory and logic applications, current academic and industrial researches focus mainly on perpendicular MTJs. This work considers also the case of perpendicular MTJ in STT-MRAM.

# 3 STT-MRAM CHALLENGES AND MOTIVATION

In this section, we first provide our detailed analyses of the STT-MRAM challenges. Based on the analyses, we then present our motivation to address these challenges.

# 3.1 Dynamic Write Power Concern

The dynamic write power concern of STT-MRAM mainly comes from three factors. The first one is that the switching of MTJ by using the STT effect requires a relatively high current density to overcome the energy barrier,  $E_b$  (or thermal stability,  $\Delta$ ) between the two stable resistance states. The critical switching current density ( $J_C$ ) for the perpendicular MTJ can be expressed as [31], [32],

$$J_C = \frac{2e}{\hbar} \frac{\alpha}{\eta} (M_s t_F) H_K. \tag{1}$$

$$\Delta = E_b / k_B T = H_K M_s V / 2k_B T, \tag{2}$$

where e is the elementary charge of electrons,  $\hbar$  the reduced planck's constant,  $\eta$  the spin transfer efficiency,  $\alpha$  the Gilbert damping constant,  $M_s$  the saturation magnetization,  $t_F$  the thickness of the FL,  $H_K$  the effective anisotropy energy, V the volume of the FL of MTJ,  $k_B$  the Boltzmann constant and T the temperature, respectively. The typical value of the critical current density and duration for the MTJ switching at 40 nm are about several MA/cm² and several nanoseconds respectively. The dynamic write power consumption per bit of STT-MRAM is one or two orders of magnitude higher than that of SRAM (see Fig. 1).

The second factor comes from the asymmetry, including the STT-driven MTJ switching asymmetry as well as the 1T1MTJ bit-cell asymmetry. The STT asymmetry is intrinsically physical, as the spin transfer efficiency  $(\eta)$  is mainly



Fig. 4. The dynamic write power distributions of the 1T1MTJ bit-cell at the 40 nm technology node.

determined by the relative magnetization orientation of the two FM layers of MTJ, expressed as [33],

$$\eta = (P/2)/(1 + P^2 \cos \theta).$$
 (3)

$$J_{P-AP}/J_{AP-P} = (1+P^2)/(1-P^2),$$
 (4)

where P the tunneling spin polarization,  $\theta$  the angle between the magnetization orientation of the two FM layers,  $J_{P-AP}$ and  $J_{AP-P}$  the critical current densities for the MTJ switching operations of  $P \rightarrow AP$  and  $AP \rightarrow P$ , respectively. We found that  $J_{P-AP}$  is much (about 1-1.5 times of magnitude) higher than  $J_{AP-P}$ . The 1T1MTJ bit-cell asymmetry is due to the source degradation of the access NMOS transistor, which affects the driving capability of the NMOS transistor because different bias conditions are utilized for writing data bits "0" and "1" [33]. As shown in Fig. 3, for writing data bit "0", i.e., AP-P switching, the WL and BL are connected to the supply voltage ( $V_{dd}$ ) while the SL is connected to the ground. The gate-source voltage ( $V_{GS}$ ) of the NMOS transistor is  $V_{dd}$ . However, for writing data bit "1", i.e., P-AP switching, the WL and SL are connected to  $V_{dd}$  while the BL is connected to the ground. In this case,  $V_{GS}$  equals  $(V_{dd} - I_{MTJ} \cdot R_{MTJ})$ , where  $I_{MTJ}$  is the write current flowing through the MTJ and  $R_{MTJ}$  is the resistance value of the MTJ, respectively. The reduction of  $V_{GS}$  results in the decrease of driving capability of the NMOS transistor. In practice, we should consider the worst-case corner for reliable write operation; therefore, we have to increase the write current pulse duration, leading to dynamic write power wastage.

The third concern comes from the stochastic property of STT mechanism and the process-voltage-temperature variations. The stochastic magnetization dynamics of the FL of MTJ under the STT effect can be characterized by solving the Landau-Lifshitz-Gilbert (LLG) equation, taking into consideration the random thermal effect [25], [34],

$$dm/dt = \gamma m_f \times (H_{eff} + H_{fluc}) - \alpha m_f \times (m_f \times (H_{eff} + H_{fluc})) + \rho_{stt} (m_f \times m_f \times m_p)$$
(5)

where  $\gamma$  is the gyro-magnetic constant,  $m_f$  and  $m_p$  are the unit magnetization vectors of the FL and PL respectively,  $H_{eff}$  the

effective magnetic field,  $H_{fluc}$  the thermal induced random field,  $\rho_{stt} = \gamma \hbar P J_{stt} / 2e \mu_0 t_f M_s$  the STT factor,  $\hbar$  the reduced Planck constant, P the STT polarization,  $J_{stt}$  the current density inducing STT effect, e the elementary charge and  $\mu_0$  the vacuum permeability,  $t_f$  the thickness of the free layer,  $M_s$  the saturation magnetization, respectively. The stochastic switching behaviours of MTJ introduce time-to-time variations. In addition, the PVT variations of the MTJs and CMOS transistors further add cell-to-cell stochasticity across the whole memory array. Again, much energy is wasted to cover the worst case of the chip. Taking all the above-mentioned concerns into consideration, Fig. 4 shows our simulation results of the dynamic write power distributions for the 1T1MTI bitcell at 40 nm node. As expected, writing data bit "1" requires much higher average dynamic write power and suffers from more serious distributions than writing data bit "0".

# 3.2 Read Reliability Concern

Typically, the data information stored in a STT-MRAM bitcell can be readout by comparing the resistance states with a pre-known reference cell. Here the resistance of the MTJ in the reference cell ( $R_{ref}$ ) is generally set as the average value between  $R_P$  and  $R_{AP}$ , i.e.,  $R_{ref} = 0.5(R_P + R_P)$ . We can firstly apply a bias voltage ( $V_{BL}$ ) on the BLs of the data and reference cells to convert the MTJ resistance states of the data and reference cells to currents ( $I_{data}$  and  $I_{ref}$ ). Then the sensed currents are converted to voltages ( $V_{data}$  and  $V_{ref}$ ) by the load of the read circuit. Finally, a digital output signal can be generated by comparing  $V_{data}$  and  $V_{ref}$ . The abovementioned signals are expressed as,

$$I_{data} = V_{b\_data} / R_{data}, I_{ref} = V_{b\_ref} / R_{ref}$$
 (6)

$$V_{data} = I_{data} \times R_{l\_data}, V_{ref} = I_{ref} \times R_{l\_ref}, \tag{7}$$

where  $V_{b\_data}$  and  $V_{b\_ref}$  are the bias voltages applied on the MTJs of the data and reference cells,  $R_{data}$  is the resistance value of MTJ in the data cell and can be either  $R_P$  or  $R_{AP}$  depending on the data bit ("0" or "1") stored in the MTJ,  $R_{l\_data}$  and  $R_{l\_ref}$  are the output resistances of the loads of the read circuit in the data and reference branches, respectively. Here we assume that  $V_{b\_data} = V_{b\_ref} = V_{bias}$  and  $R_{l\_data} = R_{l\_ref} = R_{load}$  in the ideal case without considering the PVT variations.

As the TMR ratio of MTJ is limited by the device material and structure, the read margin is also limited. Here the RM is defined as the minimum absolute value between  $I_{data}$  (or  $V_{data}$ ) and  $I_{ref}$  (or  $V_{ref}$ ), i.e.,

$$RM = \begin{cases} \min\{\left|I_{ref} - I_{data0}\right|, \left|I_{ref} - I_{data1}\right|\}, or \\ \min\{\left|V_{ref} - V_{data0}\right|, \left|V_{ref} - V_{data1}\right|\}, \end{cases}$$
(8)

where  $I_{data0}$  and  $I_{data1}$  are the sensed currents flowing through the data cell when it is in low (i.e., bit "0") and high (i.e., bit "1") resistance states respectively,  $V_{data0}$  and  $V_{data1}$  are the corresponding voltages of the data cell. As can be seen, the RM is proportional to the TMR ratio and  $V_{bias}$ . However increasing TMR is intrinsically limited by the MTJ device technology while  $V_{bias}$  is clamped for protection of the data and reference cells from RD. This clearly indicates a design trade-off between RM and RD (see Fig. 2). To make things even worse, as technology scales, the critical



Fig. 5. The scaling trends of the critical switching current of MTJ as well as the required read current.

switching current of MTJ dramatically reduces with respect to the MTJ size (see Fig. 5). Normally, the read current (or  $V_{bias}$ ) should decrease correspondingly to ensure sufficiently low probability of RD ( $Pr_{RD}$ , see Eq. (9)) [13]. On the other hand, the PVT variations of both MTJs and CMOS transistors increase as technology scales down, instead of decreasing, we should increase the read current for achieving enough RM. As can be seen from Fig. 5, when technology scales below 30 nm, the required read current approaches the critical switching current of MTJ. In this case, it is rather difficult to achieve a target RM with no RD. Therefore, the readability concern has become a new critical challenge for nanoscale STT-MRAM.

$$Pr_{RD} = 1 - \exp\left\{-\frac{t_{read}}{\tau_0} \exp\left[-\Delta\left(1 - \frac{I_{read}}{I_c}\right)\right]\right\},$$
 (9)

where  $I_{read}$  and  $t_{read}$  are the read current pulse amplitude and duration respectively,  $\tau_0$  the attempt period and  $I_C$  the critical switching current amplitude of MTJ.

#### 3.3 Motivation

Based on the above analyses, we have the following findings: a) For the dynamic write power concern, the critical switching current and the stochasticity of the STT-driven MTI switching mechanism are mainly determined by the intrinsic device properties as well as manufacturing process, for which we cannot do much things from the design point of view. However, the power consumption due to the bit-cell asymmetry and PVT variations can be significantly reduced through the circuit or architecture level techniques, by exploiting the behavior characteristics. As writing data bit "1" (i.e.,  $P \rightarrow AP$  switching of MTJ) consumes much more power than writing bit "0" (i.e.,  $AP \rightarrow P$  switching of MTJ), we can design to reduce the writing bit "1" operations as much as possible; b) For the readability concern, the conflict between RM and RD is indeed a big challenge in nanoscale technology nodes. One optional solution that can improve RM but with no increase of RD is the differential sensing scheme. In DS scheme, two complementary cells



Fig. 6. The overall schematic of the proposed PDS framework.

(one in "P" and the other one in "AP states" or vice versa) are utilized to store one single bit of data and it compares directly the sensed voltage (or current) difference between the two cells. The RM of the DS scheme is RM =  $|I_{data0} - I_{data1}|$  or  $|V_{data0} - V_{data1}|$ , which is theoretically double compared with that of the typical one (see Eq. (8)) under the same bias voltage. Meanwhile, the read current of the DS scheme maintains the same as that of the typical one, leading to no increase of RD probability. The main problems of the DS scheme are its area and power overheads, as it requires two 1T1MTJ bit-cells (named 2T2MTJ cell) to store one single bit of data. If we can reduce the area and power overheads, DS scheme is preferable solution in deeply scaled technology nodes; c) We find that the dynamic write power and read reliability concerns are actually interrelated together, which suggests us dealing with the two concerns simultaneously. Based on these findings, we propose a synergistic pseudo-differential sensing framework to jointly address the challenges of STT-MRAM from a holistic device-circuitarchitecture cross-layer co-design perspective.

#### 4 PSEUDO-DIFFERENTIAL SENSING FRAMEWORK

The overall schematic of the proposed PDS framework is shown in Fig. 6, which mainly includes three techniques, including cell cluster structure, asymmetric sensing amplifier and self-error-detection-correction module. Within the proposed PDS framework, the data representation, memory access operation and peripheral sensing amplifier should all be re-designed, which will be presented in detail as follows.

#### 4.1 Cell Cluster Structure

In the proposed cell cluster structure, several (e.g., three) MTJs are formed together to represent one data symbol of two bits (see Table 1) and the data bits are readout by comparing the resistance states of the MTJs within the cell cluster through the DS scheme. As we utilize three MTJs (with eight resistance states) to represent only two data bits (four symbol states), two different resistance states of the cell cluster are used to stand for the same data symbol. This is named state-restrict mapping (abbreviated as SR-mapping). In specific, the resistance states {P-P-P} and {AP-AP-AP} stand for data symbol (00), {P-P-AP} and {P-AP-AP} stand for (01), {P-AP-P} and {AP-AP-P} stand for (10), and

(11)

| Data Symbols (2 bits) | MTJ Resistance<br>States in Cell Cluster | Outputs of ASAs  |
|-----------------------|------------------------------------------|------------------|
|                       | {MTJ0, MTJ1, MTJ2}                       | ASA0, ASA1, ASA2 |
| (00)                  | {P-P-P}<br>{AP-AP-AP}                    | [0-0-0]          |
| (01)                  | {P-P-AP}<br>{P-AP-AP}                    | [0-0-1]          |
| (10)                  | {P-AP-P}<br>{AP-AP-P}                    | [0-1-0]          |

[1-0-0]

 $\{AP-P-P\}$ 

{AP-P-AP}

TABLE 1

Data Representation in the PDS Framework

{AP-P-P} and {AP-P-AP} stand for (11). Based on the analyses in Section 3, we have known that writing data bit "1" (i.e., AP) consumes much more power than writing data bit "0" (i.e., P). In this case, we can choose the cell cluster resistance states with less "AP" to represent the data symbol i.e.,  $\{P-P-P\}\leftrightarrow(00), \{P-P-AP\}\leftrightarrow(01), \{P-AP-P\}\leftrightarrow(10) \text{ and } \{AP-P-P\}\leftrightarrow(10)$  $P \leftarrow (11)$ . As a result, any write operation of the cell cluster involves at most one  $P \to AP$  and  $AP \to P$  transitions (see Fig. 7a). Another strategy is to first reset the cell cluster to the {P-P-P} state, then to the target state based on the new data symbol (see Fig. 7b), named two-step write strategy. As can be seen, in both strategies, the required average energy per bit of the proposed cell cluster is the same as that in the typical 1T1MTJ memory cell. On the other hand, the read reliability (or RM) can be greatly improved, since the DS scheme is employed to readout the data bits stored in the cell cluster. In addition, as two different resistance states of the cell cluster represent only one data symbol (see Table 1), which adds redundancy for improving data robustness, which will be shown in Section 4.3.

The cell cluster can be formed through direct combination of three 1T1MTJ bit-cells, named 3T3MTJ cell cluster, which, however, results in much area penalty. Alternatively, as we know that the MTJs in STT-MRAM are fabricated on top of the NMOS access transistors by using the back-end-of-line (BEOL) process technology and the bit-cell area mainly depends on the size of the NMOS access transistor [35]. Therefore, we propose to cluster the MTJs within one cell cluster on top of one NMOS access transistor, named 1T3MTJ cell cluster, for area-efficiency consideration, as shown in Fig. 8. With this proposed cell cluster, the memory array (which will be shown in Section 4.4) should be reorganized, where the BLs within a local cell cluster form a BL



Fig. 7. The state transition diagram of the proposed cell cluster; (a) Direct write strategy; (b) Two-step write strategy.



Fig. 8. The proposed 1T3MTJ cell cluster structure; (a) circuit symbol; (b) cross-sectional layout view.

cluster and are accessed simultaneously through a global BL (G BL). Under the proposed PDS framework, each 1T3MTJ cell can be treated as a multi-level cell and be controlled by a G BL, a WL and a SL, similar to the typical 1T1MTJ cell. The three MTJs in the cell cluster can be theoretically viewed as one storage element that can store 2 bits of data. People may raise a question that the NMOS access transistor might fail to provide enough current drivability for writing the three MTJs within the cell cluster at the same time if the size of the transistor is limited. Fortunately, as discussed above, each write operation of the cell cluster involves at most one P-AP and one AP-P switching (see Fig. 7a). In this case, we can either utilize a read-before-write (RBW) method [36], [37], which is widely employed for data write operation in memory, or employ the two-step write strategy shown in Fig. 7b for writing data information into the cell cluster. In both strategies, each write operation of the cell cluster involves only one MTJ switching, requiring no additional drivability (or increase of size) of the NMOS access transistor compared with that of the typical 1T1MTJ bit-cell.

# 4.2 Asymmetric Sensing Amplifier

In the DS scheme, the MTIs of the two memory cells (or the two inputs of the sensing amplifier, SA) are always in complementary resistance states; therefore, the SA can output a digital signal by comparing the voltage (or current) difference between the two inputs. In the proposed PDS framework, however, as can be seen from Table 1, the inputs of the SA may be with the same state. For example, given the resistance state of cell cluster state as {P-P-P}, the three MTJs (inputs of the three SAs) are all in the low resistance states. In this case, the output of the SA will be instable mainly depending on the PVT variations. To solve this problem, we propose an asymmetric sensing amplifier particularly for the PDS framework, as illustrated in Fig. 9a, in which we intentionally add a pre-known input-offset ( $\Delta V$ ) between the two inputs of the SA. This input-offset can be introduced, for example, by changing the load resistances between the two branches or by pre-charging one of the inputs of the conventional SA. Fig. 10 shows an example of the ASA design with the pre-charging method, in which we can change  $\Delta V$  dynamically by modulating the pre-charge voltage (V\_pre) based on the practical applications. Without loss of generality, we assume that the right input of the ASA always has a positive  $\Delta V$  compared to that of the left input. In this configuration, the ASA will output a digital bit "0" if the potential of the right input is higher than that of the left input; otherwise, it outputs



Fig. 9. (a) The concept of the proposed ASA design; (b)-(e) Four possible input cases and the corresponding outputs.

a digital bit "1" if the potential of the right input is smaller than that of the left input.

As a result, there are three cases of outputs for the proposed ASA (see Figs. 9b, 9c, 9d, 9e): (a) The MTJ resistance states for the two inputs of the ASA are the same, i.e., {P-P} or {AP-AP}, then the right input of the ASA has a larger potential than the left input and the ASA outputs a digital bit "0". The RM depends on the amplitude of  $\Delta V$ . (b) The MTJ resistance states for the two inputs of the ASA are {P-AP}, then the ASA outputs a digital bit "0" and the RM is  $(|V_{data1} - V_{data0}| + \Delta V)$ , which is even larger than that of the DS scheme by  $\Delta V$ . (c) The MTJ resistance states for the two inputs of the ASA are {AP-P}, then the ASA outputs a digital bit "1" and the RM is  $(|V_{data1} - V_{data0}| - \Delta V)$ , which is less than that of the DS scheme by  $\Delta V$ . In summary, we will have the digital outputs of the three ASAs for all the resistance states of the cell cluster as follows: [000] ↔ {P-P-P, AP-AP-AP [001] $\leftrightarrow$ {P-P-AP, P-AP-AP}, [010] $\leftrightarrow$ {P-AP-P, AP-AP-P},  $[100] \leftrightarrow \{AP-P-P, AP-P-AP\}$ , as shown in Table 1. Then the design challenge turns to how to set the value of  $\Delta V$  to achieve the optimal RM. An intuitive value is  $\Delta V =$  $(V_{data0} + V_{data1})/2$  to trade off among all the three cases, then the average RM of the proposed ASA is exactly the same as that of the typical SA when considering no PVT variations (ideal case). This is the origin of the name of "pseudo-DS". Fortunately, even in this case, the PDS (or ASA) outperforms the typical SA when taking into consideration the PVT variations for the following three reasons: (a) In the proposed PDS scheme, there is no need of reference cell. All the data cells at the same time act as reference cells



Fig. 10. The schematic of the proposed ASA design with the pre-charging method as an example.



Fig. 11. (a) The schematic of the SEDC module combined with the ASA design; (b) The implementation of the majority voter.

for each other in a local cell cluster. Therefore, there is no regularity problem, which denotes the process parameter difference between the data and reference cells. (b) The proposed ASA only involves MTJs locally in a cell cluster, thereby it has the advantage of little parasitic mismatch compared with the typical SA, in which the reference cell is in general globally shared by all the data cells along the BL or WL. (c) As three outputs can be obtained to represent only two data bits, adding redundancy for further error detection and correction (which will be shown in the next Section 4.3). In practical applications, we can first test the STT-MRAM chip after fabrication and then set the value of  $\Delta V$  adaptively according to the real PVT variations in the chip.

# 4.3 Self-Error-Detection-Correction

By combining the proposed cell cluster structure and the ASA design, the PDS framework provides also self-errordetection-correction capability. First, we discuss the selferror-correction (SEC) capability. Since two different resistance states of the cell cluster are used to represent one single data symbol (see Table 1), the correct data information can be readout if the errors result in cell cluster state transition exactly between these two states. For example, assume that the data symbol is (01) and the initial resistance state of the cell cluster is {P-P-AP}, then the output results of the three ASAs are supposed to be [001]. If the resistance state of the middle MTJ flips from P to AP because of write failures, radiation particles, thermal fluctuation or any other possible errors, then the resistance states of the cell cluster state turns to be {P-AP-AP}. Fortunately, the stored data symbol can still be correctly readout to be [001], thanks to the redundancy of the cell cluster as well as the ASA design. As can be seen, this SEC process is automatic and transparent for users. Nevertheless, the error correction capability is limited by the specific error patterns.

Next, we discuss the self-error-detection (SED) capability of the proposed PDS framework. As can be seen from Table 1, for all the resistance states of the cell cluster, there are four combinations of outputs for the three ASAs, including [000], [001], [010], and [100]. We can find that at most one output bit "1" is valid for a correct data symbol. With this finding, we can introduce the SED functionality into the PDS framework by only adding a majority voter, as shown in Fig. 11. If two or more ASAs output digital bits "1" due to any possible errors, the SEDC module is able to detect the error and output an acknowledge signal (ACK). Based on



Fig. 12. (a) The overall PDS framework for STT-MRAM; (b) Typical 1T1MTJ cell array; (b) The 3T3MTJ cell cluster array; (d) The 1T3MTJ cell cluster array; (e) layout of the 1T1MTJ cell array ( $2 \times 4$ ) for storing 8 bits; (f) layout of the 1T3MTJ cell array ( $2 \times 2$ ) for storing 8 bits.

this ACK signal, the error bit can be located and corrected accordingly, or further architecture techniques can be employed, e.g., error correction coding or checkpointing, which are out the scope of this paper.

## 4.4 Overall Framework

Integrating the above three design techniques, we present the overall PDS framework for STT-MRAM, as shown in Fig. 12. Herein we consider both the 1T1MTJ (i.e., to form 3T3MTJ cell cluster) and 1T3MTJ cell structures in the proposed PSD framework, as shown in Figs. 12c and 12d, respectively. With the 1T3MTJ cell cluster, more routing overhead will be induced for a memory array, as three BLs are required for each memory cell. However, the 1T3MTJ cell can store 2 bits of data with only one transistor. In general, the total area of the transistors is much larger than that of the BLs in a memory chip. Therefore, the overall area efficiency will not be degraded with the 1T3MTI cell structure. Figs. 12e and 12f show the layouts of the 1T1MTJ cell array  $(2 \times 4)$  and the 1T3MTJ cell array  $(2 \times 2)$ , respectively, for storing 8 bits of data as an example. As can be seen, the 1T3MTJ cell array has smaller ( $\sim$ 25.2 percent) area than that of the 1T1MTJ cell array for storing 8 bits of data. The main differences of the PDS framework from the typical 1T1MTJ cell (see Fig. 12a) based STT-MRAM are the memory cell array, read circuit, data representation and memory controller. During write operations, input data bit sequence is firstly truncated into (2-bit) data symbols, and then the data symbols are stored in the cell clusters through the write driver. During read operations, data symbols stored in the cell clusters are readout by the ASAs, then they are mapped back to the original data bit sequence before entering the input/output (I/O) module. If an error is detected by the SEDC module, an acknowledgement signal is generated and transferred to the memory controller. These processes are carried out by the memory controller and transparent for the users.

# 5 EXPERIMENTAL EVALUATIONS

### 5.1 Cross-Layer Evaluation Platform

In this section, we provide comprehensive evaluations of the proposed PDS framework. The cross-level evaluation platform mainly includes three parts: STT-MTJ SPICE modeling at device level, cell cluster and ASA designs at circuit level, as well as memory and processor configurations at architecture level, as shown in Fig. 13. The STT-MTJ device was modelled by solving the LLG equation (see Eq. (5)) and was implemented with the Verilog-A language. The key parameters and constants of the electrical model



Fig. 13. Overview of the cross-layer evaluation platform.

TABLE 2
Parameters and Constants of the STT-MTJ Model

| Parameter        | Description                | Default Value                                      |
|------------------|----------------------------|----------------------------------------------------|
| $\overline{M_s}$ | Saturation magnetization   | $0.625 \times 10^6 \text{ A/m}$                    |
| d                | Diameter of MTJ            | 40 nm                                              |
| $t_{ox}$         | Oxide layer thickness      | 0.85 nm                                            |
| $t_f$            | Free layer thickness       | 1.1 nm                                             |
| ά                | Gilbert Damping Factor     | 0.027                                              |
| P                | Spin polarization          | 0.56                                               |
| T                | Temperature                | 300 K                                              |
| TMR              | TMR ratio                  | 150%                                               |
| $\Delta$         | Thermal stability          | 60                                                 |
| $I_C$            | Critical switching current | $50 \mu A$                                         |
| Constant         | Description                | Default Value                                      |
| γ                | Gyromagnetic ratio         | $2.21276 \times 10^5 \text{ m/(A} \cdot \text{s)}$ |
| $\mu_0$          | Vacuum permeability        | $1.2566 \times 10^{-6}  \text{H/m}$                |
| $k_B$            | Boltzmann constant         | $1.38 \times 10^{-23} \text{ J/K}$                 |
| e                | Elementary charge          | $1.6 \times 10^{-19} \text{ C}$                    |
| C                |                            |                                                    |

are listed in Table 2. This electrical model can be adaptive with different parameters and the parameters involved in this paper are adopted from the recent mainstream experimental results. The circuit level design and evaluations were performed on the Cadence platform with the developed STT-MTJ SPICE model and a 40 nm CMOS design-kit. The architecture level evaluations were carried out on the NVSim [38] and Gem5 [39], with which cache memory and processor configurations are simulated. This platform can be used to explore STT-MRAM for cache memory in microprocessor. The cache latency and energy parameters are simulated with NVSim [38], which is modified to adopt the PDS framework. Furthermore, the Gem5 simulator was modified correspondingly. The overhead of the decoders and write drivers are included in the simulations. Under the syscall emulation (SE) mode, the baseline configuration is shown in Table 3. A single core with 32 KB L1 data cache/32 KB L1 instruction cache and 1 MB L2 cache was chosen to evaluate the performance of various cache configurations. We selected 14 benchmarks (see Table 3) from

TABLE 3 Simulation Setup

| Component      | Configuration                                                                                                                                                                                          |
|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| CPU            | Single core, 2 GHz, out-of-order                                                                                                                                                                       |
| L1             | Inst./Data 32K Byte/32K Byte, 64 Byte line,<br>2-way, 1 bank, Write-back<br>SRAM: Lat. 2 Cycles                                                                                                        |
| L2             | 2M, 64Byte line, 8-way, 1 bank, Write-back SRAM: Latency, 9 Cycles 1T1MTJ: Lat. (R/W), 10/16 Cycles 2T2MTJ: Lat. (R/W), 10/16 Cycles 3T3MTJ: Lat. (R/W), 10/16 Cycles 1T3MTJ: Lat. (R/W), 10/16 Cycles |
| Execution Unit | 2x ALU, 2x CALU, 2x FPU for each core                                                                                                                                                                  |
| Main Memory    | 8 GB DDR3 1,600 MHz, 120 Cycles, 12.8 GB/s.                                                                                                                                                            |
| Workload       | Lbm, Mcf, Soplex, Libquantum, Leslie3d, Milc,<br>Bzip2, Hmmer, Gromacs, Namd, Perlbench,<br>Povray, Sjeng, Gobmk                                                                                       |



Fig. 14. Transient waveforms of the proposed ASA when performing read operations of the cell cluster.

SPEC CPU 2006 for testing and performed two billion instructions for each benchmark in the simulations.

# 5.2 Device and Circuit Level Evaluations

Using the developed STT-MTJ SPICE model and a CMOS design-kit, we designed the 1T3MTJ cell cluster, ASA and the SEDC module for STT-MRAM in the 40 nm technology node. Then we carried out transient and Monte-Carlo simulations to demonstrate the performance of the proposed PDS framework. Fig. 14 illustrates the transient waveforms of the ASA when performing write/read operations of the cell cluster. Here "R" and "W" denote the read and write cycles respectively. As can be seen, the output results of the three ASAs are consistent with those listed Table 1, validating the functionality of the cell cluster and the ASA designs. Furthermore, with Monte-Carlo simulations, Fig. 15 shows



Fig. 15. Monte-Carlo simulations of the RM distributions of the (a) typical SA, (b) ASA with input states {P-P} or {AP-AP}, (c) ASA with input states {P-AP} and (d) ASA with input states {AP-P}.



Fig. 16. Capacity scaling of our proposed PDS scheme, including the 3T3MTJ and 1T3MTJ, normalized to the baseline SRAM-based cache.

the distributions of RM for the proposed ASA and the typical SA. As can be seen, for the ASA, the case with input MTJ states {P-AP} achieves the biggest RM, while the cases with input MTJ states {P-P} and {AP-AP} result in the smallest RM, which are consistent with our analyses in Section 4.2. Furthermore, the average RM of the proposed ASA even in the worst-case is much better (~35.6 percent) than that of the typical SA, which proves the effectiveness of the proposed design.

## 5.3 Architecture Level Evaluations

First, we evaluate the performance of the proposed PDS scheme for application of cache with the NVSim simulator.

Capacity Scaling. The capacity scaling behaviors of the proposed PDS scheme in cache memory is shown in Fig. 16, which is normalized to the baseline SRAM-based cache. As can be seen, the performance in terms of area, energy and power of both the proposed PDS-based (including the 3T3MTJ and 1T3MTJ cell cluster structures) and the typical 1T1MTJ based STT-MRAM caches are improved as the capacity increases. Moreover, with larger capacities (e.g., > 1 MB), the STT-MRAM based caches exhibit better performance compared to the conventional SRAM-based one. In particular, our proposed PDS scheme with the 1T3MTJ cell cluster is the optimal choice in terms of area, read energy, read latency, write latency, write energy and leakage power.

Area. It is expected that there is an optimal capacity for the cache design to increase hit rate. When the cache is with small capacity (e.g., < 512 KB), the peripheral circuits occupie the major area of the chip. As the capacity increases, the cache area is dominated by the memory cell array. Fig. 16a shows the comparison of five caches. As can be seen, when capacity is above 1 MB, the STT-MRAM based caches have better area-efficiency compared with the SRAM based one. This is because that the memory bit-cell area of STT-MRAM is smaller than that of SRAM (6T). Particularly, the proposed PDS scheme with the 1T3MTJ cell cluster has the best area-efficiency, about  $4\times(2\times)$  improvement compared to the SRAM (or typical 1T1MTJ)-based caches.

Read/Write Performance. The read latency, read energy, write latency and write power are shown in Figs. 16b, 16c, 16d, and 16e, respectively. The simulation results indicate that the read and write performances of the STT-MRAM based caches are inferior to the SRAM-based one when the

capacity is relative small, as the SRAM bit-cell has much lower power and faster access speed than those of STT-MRAM. As capacity increases, however, depending on the bit-cell area, the overall chip area of the SRAM based cache increases more quickly than those of the STT-MRAM based ones. In this case, the signal transmission power and latency dramatically increase, resulting in rapid growth of overall power and latency for the SRAM based cache. For the read performance, the proposed PDS scheme with the 1T3MTJ cell cluster outperforms SRAM when capacity is over 1 MB. For the write performance, the write latency of the STT-MRAM bit-cell is much higher than that of SRAM, therefore the advantage occurs only if the capacity exceeds 16 MB. On the other hand, the write power outperforms SRAM when the capacity exceeds 512 KB.

Leakage Power. As the MTJ devices are nonvolatile, most of the leakage power in STT-MRAM based caches originates from the peripheral CMOS circuits. As can be seen from Fig. 16f, the STT-MRAM based caches achieve much lower leakage power (~830%@8MB) compared to the SRAM based one and the advantages are more obvious as the capacity grows.

In summary, the proposed PDS scheme with the novel 1T3MTJ cell cluster for the STT-MRAM based cache exhibits promising potential in replacing conventional SRAM-based cache, especially in relatively large capacities. To confirm it, we provide further evaluations to show the efficiency of the proposed PDS scheme by integrating the cache memory into the processor with the Gem5 simulator.

Execution Time with Capacity of 1 MB. We first evaluate the execution time with the same capacity, normalized to the baseline SRAM, for different benchmarks, as shown in Fig. 17. The average execution time of the proposed PDS scheme with the 1T3MTJ cell cluster is similar to that of SRAM, but with an improvement compared to the typical 1T1MTJ cell structure.

Energy Consumption with Capacity of 1 MB. Fig. 18 shows the energy comparison results, normalized to the SRAM based cache. The simulation results demonstrate that the proposed PDS scheme with the 1T3MTJ cell cluster reduces energy by  $\sim$ 32.9 percent ( $\sim$ 22.6 percent) on average, compared with the SRAM (or typical 1T1MTJ) based cache.

*IPC with the Same Chip Area.* We also compare the instruction per cycle (IPC) with the same chip area, normalized to



Fig. 17. Normalized execution time with the same capacity of 1 MB.

the SRAM based cache (1 MB). As shown in Fig. 19, the proposed PDS scheme with the 1T3MTJ cell cluster has  $\sim\!\!1.3$  percent ( $\sim\!\!3.6$  percent) improvement compared with the SRAM (or typical 1T1MTJ) based cache. Even though the SRAM has lower write and read latencies than STT-MRAM (see Figs. 16b and 16d), the proposed PDS scheme achieves a higher capacity with the same area, resulting in improved IPC.

Miss Rate with the Same Chip Area. In Fig. 20, we compare the miss rate with the same chip area, normalized to the SRAM based cache (1 MB). On average, the proposed PDS scheme with the 1T3MTJ cell cluster has an improvement of  $\sim$ 36.9 percent ( $\sim$ 34.6 percent) compared with the SRAM (or typical 1T1MTJ) based cache. The improved miss rate reduces the operations for fetching data from the next-level cache.

*Sumamry*. These results confirm our motivation of utilizing the proposed PDS (1T3MTJ)-based STT-MRAM in place of SRAM for cache applications.

# 6 RELATED WORK

As introduced previously, dynamic write power and read reliability concerns have become two critical challenges for STT-MRAM developments and applications. Until now, there have been many related studies that try to solve these challenges. In addition, a number of studies has been done to employ STT-MRAM for cache design through system-level modeling and simulation. In this section, we provide a brief introduction and analysis of these related studies.

Most studies deal with these two concerns separately. For reducing the dynamic write power, device, circuit as well as architecture techniques were proposed in literature. At device level, MTJs with low critical witching current or thermal stability factor were designed [14], [31] or filed-assisted writing mechanism were utilized [15], [40] to reduce the write power. Meanwhile, asymmetrically dopted



Fig. 18. Normalized energy with the same capacity of 1 MB.



Fig. 19. Normalized IPC with the same chip area.

transistor was designed to provide asymmetrical current drivability considering the asymmetry property of the 1T1MTJ cell structure [41]. At circuit level, voltage boosting techniques and self-terminated write drivers [16], [42] were proposed to address the time-to-time and cell-to-cell variations due to the impact of PVT variations. At architecture level, read-before-write and bit-flipping algorithms [17], [43] were proposed to reduce the write activities for unnecessary write operations.

For improving the read reliability, the key challenge is to obtain a good trade-off between the RM and RD. Lots of read circuits have been designed, which can be classified into three categories, including typical sensing, self-reference sensing and DS schemes. In typical sensing scheme, a reference cell is required and many studies are proposed to optimize the reference cell [24], the sensing amplifier [11] as well as the array structure [44]. Jung' group in Yonsei University proposed a number of read circuit design within this categories [45], [46]. In the self-reference sensing scheme, including destructive ones [20] as well as non-destructive ones [21], no reference cell is needed. However, the destructive self-reference sensing schemes require writing the data back into the memory cells, leading to dynamic power wastage. The DS scheme is able to double the RM with no additional of RD, however, two complementary memory cells are required, resulting area overheads [22]. The DS scheme is generally used in logics instead of memories. In addition, error correction codes (ECC) are also used to improve the reliability of STT-MRAM [13], [47].

Regarding employing STT-MRAM for cache design through system-level modeling and simulation, many studies have been done in literature. Some previous studies simply replace (fully or partly) SRAM with STT-MRAM to reduce static leakage power [48], [49], [50], [51] which, however, results in little performance benefit because of the



Fig. 20. Normalized miss rate with the same chip area.

dynamic write/read power and latency overhead of STT-MRAM. Some other works try to address the write power and read reliability concerns of STT-MRAM through system-level techniques, such as dynamic data management, and power control etc. [52], [53], [54], [55]. Differently, our work proposes a novel cross-layer PDS framework to address these concerns of STT-MRAM. The proposed PDS framework, which is the primary contribution of this paper, can greatly facilitate STT-MRAM for cache applications.

In summary, prior works focus only on either the dynamic write power or the read reliability of STT-MRAM. They ignore a fact that these two concerns are actually interrelated together. In addition, they are mostly from one single layer design point of view. On the contrary, our proposed PDS scheme deal with the two concerns simultaneously with a cross-layer co-design strategy.

# 7 CONCLUSION

Dynamic write power and read reliability concerns are two key challenges for practical applications of STT-MRAM. Traditional techniques deal with these two challenges separately and are hard to achieve overall performance improvement. In this work, we proposed a novel PDS framework, which is a synergistic solution integrating three design techniques, i.e., cell cluster, ASA and SEDC, to jointly address these two concerns of STT-MRAM. In addition, a new data representation method, named SR-mapping, was designed to coordinate with the proposed PDS framework. By developing a STT-MTJ SPICE model, we performed circuit level simulations to evaluate the functionality of the proposed PDS framework. After that, we carried out architecture level evaluations by integrating the proposed PDS scheme into the cache and processor configurations. Our experimental results show that the proposed PDS framework with the 1T3MTJ cell cluster significantly improves the reliability and performance compared with the typical 1T1MTJ cell structure. Meanwhile, it has the potentials of high density, high speed and low power for cache applications in place of conventional volatile SRAM.

#### **ACKNOWLEDGMENTS**

This work was supported by the China Postdoctoral Science Foundation (2015M570024), and the National Natural Science Foundation of China (61501013 and 61571023). Weisheng Zhao is the corresponding author of this paper.

#### REFERENCES

- [1] N. S. Kim, "Leakage current: Moore's law meets static power," *Computer*, vol. 36, no. 12, pp. 68–75, Dec. 2003.
- [2] L. Li, et al., "Leakage energy management in cache hierarchies," in Proc. Int. Parallel Architectures Compilation Techn., 2002, pp. 131– 140
- [3] P. Hammarlund, et al., "Haswell: The fourth-generation Intel core processor," *IEEE Micro*, vol. 34, no. 2, pp. 6–20, Mar./Apr. 2014.
- [4] C. J. Xue, G. Sun, Y. Zhang, J. J. Yang, Y. Chen, and H. Li, "Emerging non-volatile memories: Opportunities and challenges," in *Proc. IEEE/ACM/ IFIP Int. Conf. Hardware/Softw. Codesign Syst. Synthesis*, 2011, pp. 325–334.
- [5] H.-S. P. Wong and S. Salahuddin, "Memory leads the way to better computing," *Nat. Nanotechnol.*, vol. 10, pp. 191–194, 2015.
- [6] G. W. Burr, B. N. Kurdi, J. C. Scott, C. H. Lam, K. Gopalakrishnan, and R. S. Shenoy, "Overview of candidate device technologies for storage-class memory," in *IBM J. Res. Dev.*, vol. 52, no. 4.5, pp. 449–464, 2008.

- [7] C. Chappert, A. Fert, and F. N. Van Dau, "The emergence of spin electronics in data storage," *Nat. Mater.*, vol. 6, no. 11, pp. 813–823, 2007.
- [8] W. Kang, et al., "Spintronics: Emerging ultra-low-power circuits and systems beyond MOS technology," ACM J. Emerg. Technol. Comput. Syst., vol. 12, no. 2, 2015, Art. no. 16
- [9] D. É. Nikonov and I. A. Young, "Overview of beyond-CMOS devices and a uniform methodology for their benchmarking," *Proc. IEEE*, vol. 101, no. 12, pp. 2498–2533, Dec. 2013.
- [10] D. E. Nikonov and I. A. Young, "Benchmarking spintronic logic devices based on magnetoelectric oxides," J. Materials Res., vol. 29, no. 18, pp. 2109–2115, 2014.
- [11] W. Kang, et al., "Variation-tolerant and disturbance-free sensing circuit for deep nanometer STT-MRAM," *IEEE Trans. Nanotechnol.*, vol. 13, no. 6, pp. 1088–1092, Nov. 2014.
- [12] Y. Zhang, Y. Li, Z. Sun, H. Li, Y. Chen, and A. K. Jones, "Read performance: The newest barrier in scaled STT-RAM," IEEE Trans. Very Large Scale Integr. Syst, vol. 23, no. 6, pp. 1170–1174, Jun. 2015.
- [13] W. Kang, et al., "Yield and reliability improvement techniques for emerging nonvolatile STT-MRAM," *IEEE J. Emerg. Sel. Topics Circuits Syst.* vol. 5, no. 1, pp. 28–39, Mar. 2015.
- [14] D. Saida, et al., "Low-current high-speed spin-transfer switching in a perpendicular magnetic tunnel junction for cache memory in mobile processors," *IEEE Trans. Magn.*, vol. 50, no. 11, pp. 1–5, Nov. 2014.
- [15] E. Eken, Y. Zhang, W. Wen, R. Joshi, H. Li and Y. Chen, "A new field-assisted access scheme of STT-RAM with self-reference capability," in *Proc.* 51st Annu. Des. Autom. Conf., 2014, pp. 1–6.
- [16] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, "Energy reduction for STT-RAM using early write termination," in *Proc. IEEE/ACM Int. Conf. Comput.-Aided Des.*—Dig. Tech. Papers, 2009, pp. 264–268.
- [17] R. Bishnoi, F. Oboril, M. Ebrahimi, and M. B. Tahoori, "Avoiding unnecessary write operations in STT-MRAM for low power implementation," in *Proc. 15th Int. Symp. Quality Electron. Des.*, 2014, pp. 548–553.
- [18] G. Sun, C. Zhang, P. Li, T. Wang, and Y. Chen, "Statistical cache bypassing for non-volatile memory," *IEEE Trans. Comput.*, 2016, Doi: 10.1109/TC.2016.2529621.
- [19] International technology roadmap for semiconductor (ITRS), "Emerging research device chapter," 2012, [Online]. Available: http://www.itrs.net.
- [20] H. Tanizaki, et al., "A high-density and high-speed 1T-4MTJ MRAM with voltage offset self-reference sensing scheme," in Proc. IEEE Asian Solid-State Circuits Conf., 2006, pp. 303–306.
- [21] Y. Chen, H. Li, X. Wang, W. Zhu, W. Xu, and T. Zhang, "A nondestructive self-reference scheme for Spin-Transfer Torque Random Access Memory (STT-RAM)," in *Proc. IEEE Des. Autom. Test Europe Conf. Exhibition*, 2010, pp. 148–153.
- [22] W. Kang, et al., "Separated precharge sensing amplifier for deep submicrometer MTJ/CMOS hybrid logic circuits," *IEEE Trans. Magn.*, vol. 50, no. 6, pp. 1–5, Jun. 2014.
   [23] H. Noguchi, et al., "Variable nonvolatile memory arrays for adap-
- [23] H. Noguchi, et al., "Variable nonvolatile memory arrays for adaptive computing systems," in *Proc. IEEE Int. Electron Devices Meeting*, 2013, pp. 25.4.1–25.4.4.
- [24] W. Kang, W. Zhao, J. O. Klein, Y. Zhang, C. Chappert, and D. Ravelosona, "High reliability sensing circuit for deep submicron spin transfer torque magnetic random access memory," *Electron. Lett.*, vol. 49, no. 20, pp. 1283–1285, 2013.
- [25] J. C. Slonczewski, "Current-driven excitation of magnetic multilayers," J. Magnetism Magn. Mater., vol. 159, no. 1/2, pp. L1–L7, 1996.
- [26] K. C. Chun, H. Zhao, J. D. Harms, T. H. Kim, J. P. Wang, and C. H. Kim, "A scaling roadmap and performance evaluation of in-plane and perpendicular MTJ based STT-MRAMs for high-density cache memory," *IEEE J. Solid-State Circuits*, vol. 48, no. 2, pp. 598–610, Feb. 2013.
- [27] D. Apalkov, et al., "Spin-transfer torque magnetic random access memory (STT-MRAM)," J. Emerg. Technol. Comput. Syst., vol. 9, no. 2, May 2013, Art. no 13.
- [28] X. Fong, et al., "Spin-transfer torque devices for logic and memory: Prospects and perspectives," IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 35, no. 1, pp. 1–22, Jan. 2016.
- [29] S. Ikeda, et al., "Perpendicular-anisotropy CoFeB-MgO based magnetic tunnel junctions scaling down to 1X nm," in Proc. IEEE Int. Electron Devices Meeting, 2014, pp. 33.2.1–33.2.4.
- [30] M. Gajek, et al., "Spin torque switching of 20 nm magnetic tunnel junctions with perpendicular anisotropy" Appl. Phys. Lett., vol. 100, 2012, Art. no. 132408.

- [31] R. Sbiaa, S. Y. H. Lua, R. Law, H. Meng, R. Lye, and H. K. Tan, "Reduction of switching current by spin transfer torque effect in perpendicular anisotropy magneto-resistive devices," *J. Appl. Phys.*, vol. 109, no. 7, 2011, Art. no. 07C707.
- [32] Y. Zhang, et al., "Compact model of subvolume MTJ and its design application at nanoscale technology nodes," *IEEE Trans. Electron Devices*, vol. 62, no. 6, pp. 2048–2055, Jun. 2015.
- [33] Y. Zhang, X. Wang, Y. Li, A. K. Jones and Y. Chen, "Asymmetry of MTJ switching and its implication to the STT-RAM designs," in Proc. IEEE Des. Autom. Test Europe Conf. Exhibition, 2012, pp. 1313–1318.
- Exhibition, 2012, pp. 1313–1318.
  [34] D. V. Berkov and J. Miltat, "Spin-torque driven magnetization dynamics: Micromagnetic modeling," J. Magnetism Magn. Mater., vol. 320, no. 7, pp. 1238–1259, 2008.
- [35] M. C. Gaidis, et al., "Two-level BEOL processing for rapid iteration in MRAM development," *IBM J. Res. Dev.*, vol. 50, no. 1, pp. 41–54, 2006.
- [36] H. Noguchi, et al., "7.2 4MB STT-MRAM-based cache with memory-access-aware power optimization and write-verify-write/read-modify-write scheme," in *Proc. IEEE Int. Solid-State Circuits Conf.*, 2016, pp. 132–133.
- [37] K. W. Kwon, S. H. Choday, Y. Kim, and K. Roy, "AWARE (Asymmetric Write Architecture with REdundant blocks): A high write speed STT-MRAM cache architecture," IEEE Trans. Very Large Scale Integr. Syst., vol. 22, no. 4, pp. 712–720, Apr. 2014
- Scale Integr. Syst., vol. 22, no. 4, pp. 712–720, Apr. 2014
  [38] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, "NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory," IEEE Trans. Comput.-Aided Des. Integr. Circuits. Syst., vol. 31, no. 7, pp. 994–1007, Jul. 2012.
- vol. 31, no. 7, pp. 994–1007, Jul. 2012. [39] N. Binkert, et al., "The gem5 simulator," SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1–7, 2011.
- [40] R. Patel, X. Guo, Q. Guo, E. Ipek, and E. G. Friedman, "Reducing switching latency and energy in STT-MRAM caches with fieldassisted writing," *IEEE Trans. Very Large Scale Integr. Syst.*, vol. 24, no. 1, pp. 129–138, Jan. 2016.
- [41] S. H. Choday, S. K. Gupta, and K. Roy, "Write-optimized STT-MRAM bit-clls uing aymmetrically dped tansistors," *IEEE Electron Device Lett.*, vol. 35, no. 11, pp. 1100–1102, Nov. 2014.
- Device Lett., vol. 35, no. 11, pp. 1100–1102, Nov. 2014.

  [42] S. Motaman, S. Ghosh, and N. Rathi, "Impact of process-variations in STTRAM and adaptive boosting for robustness," in *Proc. 2015 Des. Autom. Test Europe Conf. Exhibition*, 2015, pp. 1431–1436.
- [43] X. Luo, et al., "Enhancing lifetime of NVM-based main memory with bit shifting and flipping," in *Proc. IEEE 20th Int. Conf. Embedded Real-Time Comput. Syst. Appl.*, 2014, pp. 1–7.
- [44] W. Kang, L. Zhang, J. O. Klein, Y. Zhang, D. Ravelosona, and W. Zhao, "Reconfigurable codesign of STT-MRAM under process variations in deeply scaled technology," *IEEE Trans. Electron Devi*ces, vol. 62, no. 6, pp. 1769–1777, Jun. 2015.
- [45] J. Kim, K. Ryu, S. H. Kang, and S. O. Jung, "A novel sensing circuit for deep submicron spin transfer torque MRAM (STT-MRAM)," *IEEE Trans. Very Large Scale Integr. Syst.*, vol. 20, no. 1, pp. 181– 186 Jan 2012
- [46] B. Song, T. Na, J. Kim, J. P. Kim, S. H. Kang, and S. O. Jung, "Latch offset cancellation sense amplifier for deep submicrometer STT-RAM," *IEEE Trans. Circuits Syst. I: Regular Papers*, vol. 62, no. 7, pp. 1776–1784. Jul. 2015.
- pp. 1776–1784, Jul. 2015.
  [47] W. Kang, et al., "A low-cost built-in error correction circuit design for STT-MRAM reliability improvement," *Microelectron. Rel.*, vol. 53, no. 9-11, 2013, pp. 1224–1229.
- [48] P. Chi, S. Li, Y. Cheng, Y. Lu, S. H. Kang, and Y. Xie, "Architecture design with STT-RAM: Opportunities and challenges," in *Proc. IEEE 21st Asia South Pacific Des. Autom. Conf.*, 2016, pp. 109–114.
- [49] T. Adegbija, "Exploring configurable non-volatile memory-based caches for energy-efficient embedded systems," in *Proc. 26th Edi*tion Great Lakes Symp. VLSI, 2016, pp. 157–162.
- [50] N. Kim, J. Ahn, W. Seo, and K. Choi, "Energy-efficient exclusive last-level hybrid caches consisting of SRAM and STT-RAM," in Proc. IFIP/IEEE Int. Conf. Very Large Scale Integr., 2015, pp. 183–188.
- [51] K. Ikegami, et al., "Low power and high density STT-MRAM for embedded cache memory using advanced perpendicular MTJ integrations and asymmetric compensation techniques," in *Proc.* IEEE Int. Electron Devices Meeting, 2014, pp. 28.1.1–28.1.4.
- [52] H. Y. Cheng, et al., "Dswitch: Write-aware dynamic inclusion property switching for emerging asymmetric memory technologies," Tech. Rep. PSU CSE16-004, pp. 1–10, 2016.

- [53] A. M. H. Monazzah, H. Farbeh, and S. G. Miremadi, "LER: Least-error-rate replacement algorithm for emerging STT-RAM caches," IEEE Trans. Device Mater. Rel., vol. 16, no. 2, pp. 220–226, Jun. 2016.
- [54] E. Arima, et al., "Immediate sleep: Reducing energy impact of peripheral circuits in STT-MRAM caches," in *Proc. 33rd IEEE Conf. Comput. Des.*, 2015, pp. 149–156.
- [55] W. K. Cheng, Y. H. Ciou, and P. Y. Shen, "Architecture and data migration methodology for L1 cache design with hybrid SRAM and volatile STT-RAM configuration," *Microprocessors Microsyst.*, vol. 42, pp. 191–199, 2016.



Wang Kang (S'12-M'15) received the BS and PhD degrees in microelectronics from Beihang University, Beijing, China, in 2009 and 2015, respectively. He is currently a post-doc in the Spintronics Interdisciplinary Center and the School of Computer Science and Engineering of Beihang Univeristy. His research interests include spintronics and its related VLSI design, advanced computer architecture as well as electronic design automation (EDA). He has published more than 40 journals and refereed

conference papers in these areas. He has also served as a peer reviewer for several journals, which include the *IEEE Transactions on Electron Devices*, the *IEEE Transactions on Very Large Scale Integration*, the *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, the *IEEE Transactions on Nanotechnology*, etc. He is a member of the IEEE.



Liang Chang (S'15) received the MS degree in microelectronics from Beihang University, Beijing, in 2014. He is currently working towards the PhD degree in the Spintronics Interdisciplinary Center and the School of Electronic Information and Engineering of Beihang University. His current research interests include reconfigurable circuit design and advanced computer architectures based on emerging non-volatile devices. He is a student member of the IEEE and CCF.



Zhaohao Wang (S'12-M'16) received the BS and MS degrees in microelectronics from Tianjin University and Beihang University, China, in 2009 and 2012, respectively, and the PhD degree in physics from the University Paris-Sud, France, in 2015. His current research interests include the modeling of emerging non-volatile nano-devices, design of new non-volatile memories and logic circuits, as well as advanced computer architectures. He is a member of the IEEE.



Weifeng Lv received the PhD degree in computer science from Beihang University, Beijing, China, in 1998. He is currently a professor and the director of the School of Computer Science and Engineering, Beihang University. He is also the deputy director of the State Key Laboratory of Software Development. His current research interests include advanced network management, big data processing, advanced computer hardware and software. He has directed dozens of projects in advanced network management, mobile internet applications and information processing services.



Guangyu Sun received his BS and MS degrees from Tsinghua University, Beijing, in 2003 and 2006, respectively and the PhD degree in computer science from Pennsylvania State University, in 2011. He is currently an assistant professor of CECA at Peking University, Beijing, China. His research interests include computer architecture, VLSI Design as well as electronic design automation (EDA). He has published more than 60 journals and refereed conference papers in these areas. He has also served as a

peer reviewer and technical referee for several journals, which include IEEE Micro, the IEEE Transactions on Very Large Scale Integration, the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, etc. He is a member of the IEEE and CCF.



Weisheng Zhao (M'06, SM'14) received the MS degree in electrical engineering from the ENSEEIHT Engineering School, Toulouse, France, in 2004 and the PhD degree in physics from the University of Paris-Sud, Orsay, France, in 2007. From 2008 to 2009, he was with embedded computing laboratory at CEA. From 2009 to 2013, he was with CNRS as a tenured research scientist. Since 2013, he joined the Spintronics Interdisciplinary Center in Beihang University and is currently a professor. His research interests

include spintronics, hybrid integration of nanoelectronic devices with CMOS circuits as well as new non-volatile memories and logics. He has authored or co-authored more than 120 scientific papers. He is a senior member of the IEEE.

> For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.